Catalog Matching with Astrometric Correction and its 
Application to the Hubble Legacy Archive 



Tamas Budavari 1 and Stephen H. Lubow 2 

1 The Johns Hopkins University, 3400 N. Charles St., Baltimore, MD 21218 
2 Space Telescope Science Institute, 3700 San Martin Dr., Baltimore, MD 21218 

budavari@jhu.edu, lubowOstsci .edu 
ABSTRACT 

Object cross-identification in multiple observations is often complicated by the 
uncertainties in their astrometric calibration. Due to the lack of standard ref- 
erence objects, an image with a small field of view can have significantly larger 
errors in its absolute positioning than the relative precision of the detected sources 
within. We present a new general solution for the relative astrometry that quickly 
refines the World Coordinate System of overlapping fields. The efficiency is ob- 
tained through the use of infinitesimal 3-D rotations on the celestial sphere, 
which do not involve trigonometric functions. They also enable an analytic so- 
lution to an important step in making the astrometric corrections. In cases with 
many overlapping images, the correct identification of detections that match to- 
gether across different images is difficult to determine. We describe a new greedy 
Bayesian approach for selecting the best object matches across a large number of 
overlapping images. The methods are developed and demonstrated on the Hub- 
ble Legacy Archive, one of the most challenging data sets today. We describe a 
novel catalog compiled from many Hubble Space Telescope observations, where 
the detections are combined into a searchable collection of matches that link the 
individual detections. The matches provide descriptions of astronomical objects 
involving multiple wavelengths and epochs. High relative positional accuracy of 
objects is achieved across the Hubble images, often sub-pixel precision in the 
order of just a few milli-arcseconds. The result is a reliable set of high-quality 
associations that are publicly available online. 



Subject headings: catalogs — astrometry — methods: statistical 
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Introduction 



Astronomical research is benefiting from the use of large searchable catalogs made possible by 
advances in detector and computer technology. With the rapidly growing capacity of CCDs, 
astronomical objects are being detected at an increasingly fast rate. A major challenge to 
astronomy is being able to synthesize this large amount of data into an organized set of 
information that takes the form of a catalog. The catalog describes not only the properties 
of the images, but more importantly the sources detected within them. In its simplest usage, 
one can quickly determine whether an object is found in some region of space and examine 
its listed properties. But there are many other possible uses of a large catalog. For example, 
non-positional (all sky) searches for objects with particular properties can provide statistical 
information on large numbers of objects. Objects that lie at the extremes of the distributions 
provide the potential for new discoveries. 

Advances in relational database technology have provided a means for building, storing, 
and searching large catalogs. This technology alone, however, is not sufficient. One reason 
is that the rate at which data is becoming available from detectors is on a path to exceed 
stora ge and processing capabilities of computers for a large class of problems ( jSzalay fc Gray 
20011 ). New approaches to data organization and new algorithms are needed to deal with 
the challenges of a telescope such as Hubble Space Telescope (HST). In the case of the HST, 
some of these arise from its small field of view and the complex geometry of its overlapping 
exposures. 

Crossmatching of detected sources from images taken at different wavelengths and at 
different epochs is a crucial capability for catalog construction. By matching source de- 
tections across multiple images to a single astronomical object, one can then determine 
spectral and temporal pr operties of the object. The statistical methodology introduced by 
Budavari fc Szalayl (120081 ) provides a clean framework for determining symmetric n-way as 



sociations, and has bee n successfully applied i n several studies includin g iHeinis et al. 



ap 

Roseboom et all f l2009t ). lKerekes et ailfhoiOMRots fc Budavari! f l20 111 ) and lBudavaril fcOllf ). 



(2009), 



Performing crossmatching on HST observations also involves adjusting the positions of the 
images to place them into better alignment. In practice, these two are related, since the 
accuracy of the alignment is determined by how well the sources match together. 

One approach to crossmatc hing involves registe ring images against a known catalog, such 
as Guide Star C atalog (GSC II; lLasker et al.l 120081 ) or the Sloan Digital Sky Survey (SDSS; 
York et al. 2000 ) SkyServer database^ In this approach, the absolute position information 



1 Visit the SkyServer online at 



http : //skyserver . sdss . org/ 
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in the catalog is used to anchor positions of matching sources detected in the images. Sources 
from different images can then be compared. The drawback to this method is that few or 
none of the sources in the catalog may be in the image. This situation is particularly true 
of images taken by the HST which has a small field of view. On the other hand, it often 
detects many more sources than have been previously detected. A different approach, the 
one we adopt here, is to crossmatch the many sources in overlapping images taken by HST 
against each other, rather than against an existing catalog. The process involves adjusting 
the positions of the overlapping images to improve the residual errors in the crossmatching. 
In this case, relative, rather than absolute, astrometry is determined involving many detected 
sources. Absolute astrometric positioning can then later be determined by matching the set 
of overlapping images as a larger unit against the absolute standards. 

Some astronomical projects, such the SDSS, were designed with the goal of providing a 
catalog. The observations are made in a way that uniformly tiles regions of sky in certain filter 
bands and at certain regular time intervals. On the other hand, the HST as well as other 
major space observatories, have generally targeted particular sources, although particular 
programs have undertaken surveys for very limited regions of the sky. For more than two 
decades, images have been taken for many independent programs, resulting in a sparse, 
complex geometry of sky coverage and in irregular time intervals between observations of 
objects. In many cases, the coverage involves partially overlapping exposures with a variety 
of outline sizes and shapes, orientations, filters, and exposure times. The HST has provided 
some of the highest resolution images ever obtained. Therefore, in spite of the challenges, 
there is a potentially important scientific gain by having a catalog for HST. 

The Hubble Legacy Archival (hereafter HLA; Jenkner et al. 2006 ) provides enhanced 
data products and advanced browsing capabilit ies online. Its produc t s include lists of de - 
tected sources and their photometric properties^ ( Whitmore et al. 2008 ; Lindsay et al. 2010 ). 



These source lists are obtained running DAOPhot flStetsonl 11987m " and Source Extractor 



( Bertin &: Arnoutsl 1996 ) software on combined, drizzled images. Each of these images is 
the result of combining exposures for a single instrument, detector and filter from a single 
visit (pointing of HST). These source lists contain information about source positions, fluxes, 
magnitudes, morphology, etc. The source lists and auxiliary HLA data provide the needed 
input information to implement our algorithms for the Hubble catalog. By crossmatching 
the source lists based on these images, we obtain multi-wavelength time-domain information 
about astronomical objects. 



2 Visit the Hubble Legacy Archive at http://hla.stsci.edu/ 

3 See source list description at http://hla.stsci.edU/hla_faq.html#Sourcel 
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In this paper we describe some novel approaches to crossmatching with application to 
the construction of a catalog for HST. In Section [2] we describe general algorithms for image 
clustering, astrometric correction, and source aggregation into matches. Section [3] describes 
the pipeline that we have constructed to build an HST catalog based on these algorithms. 
Section H] contains an analysis of the properties of this catalog for ACS/WFC and WFPC2. 
Section [5] concludes our study. 



2. Improved Matching and Astrometry 



To create a reliable set of associations, we nee d high-precision astrometry. Solutions exist 
for the global World-Coordinate System (WCS; iGreisen fc Calabrettal 120021 ) astrometric de- 



termination of large images (e.g.. lHogg et al.ll2008l ). However, small images are still difficult 
to work with, especially when the sources are very faint, since they typically do not contain 
a sufficient number of calibration standard objects. The only possibility is to cross-calibrate 
the set of small overlapping images to obtain improved relative astrometry. Once several 
of the small images are locked in and tied together, the number of available standards will 
increase that enables a more accurate absolute positioning. First we discuss how to cross- 
calibrate two or more images to improve their relative accuracy. Later in the section, we 
describe a Bayesian method that determines the matching sources from sets of many nearby 
detections in overlapping images. 



2.1. Finding Pairs 



Crossmatching millions of sources contained in many irregularly placed images on the sky 
is a computationally challenging task. A substantial reduction in computational overhead 
is obtained by determining disjoint sets of overlapping images through a friends-of-friends 
(FoF) algorithm. We use the term mosaic to describe each of these disjoint sets. Since the 
mosaics are disjoint, the source matching is carried out independently within each and every 
one of them. The first step in creating sets of matched sources involves the identification 
of pairs of sources that reside in different images and are close together on the celestial 
sphere. A tolerance is then needed to be applied that depends on the accuracy of the 
relative astrometry. 
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2.2. 



Astrometric Corrections 



Given the set of matched pairs in a mosaic, determined with some tolerance in separation, we 
next adjust the relative astrometry of the images that make up the mosaic, in order to reduce 
the separations of sources in the pairs. The traditional approach is to apply corrections to the 
World- Coordinate System, which is often very expensive computationally, and involves many 
trigonometric function evaluations. Here we choose a different approach, which is faster to 
calculate, and can accomplish multiple objectives in just one step. It is common practice 
to consider translations and rotations of the images in their X-Y plane. Our approach, 
however, is to use three-dimensional rotations on the celestial sphere. Such transformations 
can account for both rotation and translation locally in the tangent plane. When the axis 
of the 3-D rotation is parallel with the pointing of a given image, the transformation indeed 
corresponds to a 2-D rotation. But if the axis is perpendicular to the pointing, the 3-D 
rotation results in a translation in the tangent plane. If we allow for any direction, a single 
transformation can describe a combined effect. Here we work with 3-D normal vectors, which 
is often the preferred way in spherical calculations. 

We first consider an idealized problem where a single image is rotated in 3-D to minimize 
the separations of its sources from those matched to a fixed reference image. Let r*j represent 
the direction of the ith detection in the image, and let q be a matching reference direction. 
We can form a set of q) pairs to be used in the astrometric correction. Now we have to 
solve an optimization problem for the 3-D rotation R. The transformed position would be 
r' = Rr. Hence the optimization formally is 



where Wi is the precision parameter that is related to the of accuracy via Wi = l/of . Much 
like the potential energy stored in springs, our cost is quadratic in the displacements, and 
can be thought of as a system of springs, where the spring constants are proportional to the 
Wi weights. The solution is the equilibrium position of the sphere with a fixed center, which 
is given by the R rotation matrix. 

This general optimization, however, is computationally expensive. For this reason, 
working with orthonormal matrices is not practical. But for small corrections, infinitesimal 
rotations are ideally suited. Infinitesimal rotations have many advantageous mathematical 
properties over general 3-D rotations. For example, they are commutative. Let us represent 
the axis and the angle of the rotation. The axis is defined by the direction and the angle is 
the length of the vector. The infinitesimal transformation is then 




(1) 



(2) 
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Considering that the change Ar = u> x r is perpendicular to r and small in amplitude, the 
resulting vector r' stays approximately normal to quadratic order in Ar. The optimization 
problem with the infinitesimal rotation becomes analytically tractable. The cost function, 
whose minimization yields the u> estimate, is now not only quadratic in the displacement 
but also in its uj parameter, 



u> = argmm 



^Wi^Ci - (ri + u) x n) j. (3) 

By requiring that the gradients equal zero, we obtain a 3-D linear equation 

Au = b (4) 

which is a result of the summations 

A = ^ Wi (l — r j <g> r») (5) 

i 

b = ^2wi(riX a) (6) 

i 

where I is the identity matrix, and eg) is the operator of the dyadic product. A derivation of 
Equation (j3J) is given in Appendix [K\ 

In keeping with the spring analogy described below Equation (jTJ), Equation (j3j) can be 
interpreted as a torque equation. Vector Cj can be considered an angular acceleration vector 
of a sphere at an early time after its release that is related to its angular displacement at 
this early time. The sphere that has springs connecting each point % of mass Wi located at 
position r*j to the fixed standard located at Cj. Matrix A is the moment of inertial tensor 
for sources that lie on the unit sphere that is subject to a torque given by b. Notice that in 
the equation for b, quantity u^Cj can be replaced by u>j(cj — which is the force due to 
the spring. 

Since A is a 3x3 matrix, Equation (J3J) has a simple analytic solution for Q, e.g., by 
applying of Cramer's rule for matrix inversion. In practice, the matrix inversion can be 
performed numerically by one of several possible methods. The cost of the aggregation is 
linear in the number of pairs, and the matrix inversion involves a small number of operations. 
Therefore, the computational overhead for this approach is likely to be close to optimal. 

So far we assumed that there is a set of {cj} calibrators in a fixed, perfect reference image, 
which is clearly not true in our case. Instead we have imprecise source positions contained in 
groups of overlapping observations, the mosaics. In each mosaic we want to correct every one 
of the images, but the combined minimization problem is not as simple as it is for a single 
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image. Since we are only concerned with relative precision, different heuristic approaches 
come to mind to work around the lack of a true set of reference positions. For example, 
one of the images could be singled out to act as a reference frame onto which all others are 
corrected. In a group of images, however, there is no guarantee that there is a single image 
with which all others would overlap, because a mosaic consists of friends of friends. Another 
option would be to use the average positions of a preliminary match. We follow a third 
approach where the correction of a given image is based on all the other images. One-by-one 
we consider each image and derive their corrections independently, effectively assuming that 
all the others are perfect. Hence we use them as calibrators. The matched pairs being applied 
to Equation are then all the pairs for the mosaic involving the image being corrected. 
Having derived the infinitesimal rotation for all images, we update the source positions with 
the appropriate corrections and repeat until convergence. The u) vectors are accumulated, 
i.e., summed up, for each image (cf. commutativity of infinitesimal rotations) and saved for 
future reference. This also enables us to safely work with clean samples of stars, but apply 
the correction for all sources afterwards. One can also use this information to correct the 
alignment of the images in their WCS headers to reflect the changes. 



2.3. Friends-of-Friends Source Aggregation 

Now we use the astrometrically improved coordinates to perform a final cross-identification. 
This time we apply a smaller tolerance than the initial crossmatch, and again produce 
matched pairs of sources, as discussed in Section [27T1 Once all close source pairs are identified 
across images, the next step is to run a friends-of-friends (FoF) algorithm, also known to 
statisticians as single-linkage hierarchical clustering, to link all nearby detections into singly 
connected graphs. Such clustering is very efficient and can be easily implemented in any 
programming environment. 

The FoF groups enumerate all the sources that can be linked together using a specified 
pairwise distance threshold. There is no guarantee, however, that this algorithm provides 
statistically meaningful matches. For example, by linking nearby sources along a line, we can 



poten tially construct very long chains whose ends are far away from each other (jEveritt et al. 



20111 ). These FoF matches are just considered preliminary and need to be studied further. 



We note that other hierarchical clustering methods also exist ( jEveritt et al.ll201ll ). whose 
performance depends on the nature of the data and the goals of the project. Here we look for 
clusters only to provide candidates for the cross-identification. Preliminary experiments with 
several off-the-shelf tools, including average and complete linkage, were performed on select 
areas without noteworthy accuracy or speed improvements. These other tools also provide 
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clusters with a wide range of sizes and shapes. For cross-matching, we are interested only in 
identifying isolated compact clusters. To accomplish this, we apply a Bayesian method to 
evaluate the likelihood of various possible clusters based on the separation of the members 
and the estimated positional errors. 



2.4. Comparing Configurations 

The question is whether breaking up the FoF groups into smaller ones could yield better 
models, and if the answer is yes, which configuration would the best. We can address the 
issue by applying a Bayesian model comparison. If D represents the entire set of n positions 
in a given FoF group, we can consider alternative hypotheses, where D is partitioned into 
disjoint components that correspond to separate objects. The baseline case is when we have 
a single component. But there could be as many components as the number of detections. 
In general, let G represent a hypothesis with V components with model directions {m^} for 
7 = 1. .T. Also let S 7 be the list of sources that are assigned to each component and _D 7 their 
measured directions such that 

r 

D = {r 1 ,r 2 ,...,r n } = \jD 1 (7) 

7=1 

and 

D 71 n D 12 = when 7l ^ 7z . (8) 



Similar to the problem of probabilistic cross-identification discussed by lBudavari fc Szalay 



( 120081 ). the likelihood of G can be derived from the astrometric uncertainties and the prior 
on the directions of the objects. In general, it is essentially just the product of the individual 
T matches 

p(D\G) = II / P{^i\G)\[Pi{ri\m^G) dm r (9) 

7=1 ?es 7 

The calculation is analytic if the uncertainty is modeled with the spherical normal distribu- 



tion (]Fisherlll953l ) 



Pi(r t \m„ G) = — — — exp fto^ • m 7 J (10) 

47T sinn Wi 

and the prior is isotropic 

p(m 7 |G) = J-5(|m 7 |-l), (11) 
where 5(-) is the Dirac 5-function. 
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We can derive the improvement over the case S when all detections are considered 



separate, cf. iBudavari fc Szalayl (120081 ). In the limit of high astrometric accuracies, w^l, 




the Bayes factor p(D\G) / p(D\S) becomes 

— r (n-jll^f- m 

where ipij is the angular separation between the and Vj vectors, and the unmarked i and 
j summations are over the corresponding 5 7 sets. Note that this equation is also valid for 
singleton groups with only one detection because the summation includes the i—j cases, for 
which the separation is by definition. 

Other subdivisions of an FoF group can be evaluated against each other using the 
corresponding Bayes factors, the ratios of their likelihoods, which can be directly evaluated 
from the previous formula. For example, comparing G\ and G2 yields 

Ba - TWhY (13) 

where the p(D\S) baseline case simply cancels out. When the value is 1, the case is indecisive 
but if Bx2 >1, the measurements prefer the G\ hypothesis; otherwise G2 is favored. 

While the statistical approach for selecting the best components is clear in principle, 
the problem is still impossible to fully solve in practice due to its combinatorial nature. For 
FoF associations of, say, only 20 sources, the computational cost of exploring all possible 
combinations is simply prohibitively large. Thus we resort to a greedy algorithm that essen- 
tially corresponds to a shrinking pairwise distance threshold. Each FoF group is represented 
as a connection graph, where the edges link the nearby sources. We build the initial graph 
of the group, and evaluate its T=l baseline likelihood. Next we repeatedly break the longest 
edge until the graph splits. We then evaluate the new hypothesis of these two components 
(r=2) against the baseline. If the new model is more likely, i.e., the Bayes factor is greater 
than unity, we save it along with (the logarithm of) its likelihood. Afterwards we repeat 
the same steps to further break the subgraphs until we end up with separate objects with 
no connections at all. The result is a set of possible associations that are better than the 
original. Out of these we can pick the most likely. 



3. Pipeline for the Hubble Catalog 



The HLA contains metadata extracted from HST images. They are stored in a commercial 
Microsoft SQL Server 2008 database. The overall design builds on the success of the Sloan 
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Digital Sky Survey's SkyServer archive and applies a number of its extensions developed for 
spatial searches and spherical geometry. In this section, we describe our implementation of 
the aforementioned algorithms that runs completely inside the database engine, which in 
turn provides efficient parallel execution and rapid access to derived data products. 

The HST pointings generally do not follow any particular tiling pattern on the sky. 
During its long life, the Hubble Telescope has taken over a half-million exposures date, and 
many of these overlap with others from different programs. We can determine the overlapping 
exposures based on their intersecting sky coverage. The geometry of all visits is stored inside 
the data base using the Spherical Library, along with its SQL routines and spatial indexing 



facilities (IBudavari. Szalay fc Feketdl2010l ) 



The HLA contains source lists that are based on applying combined drizzled image files 
to DAOPhot and Source Extractor software. They provide positional and other information 
for each detected source. We have carried out the crossmatching on the set of Source Extrac- 
tor source lists available from DR6 of the HLA for ACS/WFC and WFPC2 taken together. 
Furthermore, we only consider sources that are determined to be good quality by Source 
Extractor, which means that the FlagsB value is less than 5. The crossmatching process 
operates on each set of source lists in a mosaic. 

The absolute astrometry of the HLA images is determined by the pointing of accuracy 
of HST, which is about 1 to 2 arc-sec in each coordinate. The HLA attempts to correct 
the absolute a stro met rJ _ further by matching against three catalogs: Guide Star Catalog 2 
(GSC2), 2MASS, and the Sloan Digital Sky Survey (SDSS) catalog. The typical absolute 
astrometric accuracy of the HLA-produced images is ~0.3 arcsecs in each coordinate. The 
majority of images, about 80% of ACS/WFC, are corrected in this way. 



3.1. White-Light Sources and Images 

The source detections made in the HLA are obtained in two steps. First, all images within a 
visit (HST pointing) for a given detector are combined to produce a "white-light" detection 
image, the deepest image possible for the visit. Source Extractor is then run on the detection 
image to identify source positions. In the second step, Source Extractor is run on each 
"color" (single filter) combined image within the visit for each detector to find sources at 
the positions identified by the detection image. If a source is found at an indicated position, 



4 Description of Flags is at http://hla.stsci.edu/hla_faq.html 

description of HLA astrometry is at |http : / /hla . stsci . edu/hla_f aq . html#astrometry 
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then its properties are determined and stored in a database table. There are currently about 
49,000 source lists containing about 46 million source detections. 

Given the above procedure for extracting sources, it makes no sense to crossmatch them 
in different filters for the same detector and visit. The reason is that, for a given detector 
and visit, color sources are constructed to have exactly the same position when they exist in 
different filters. Therefore, crossmatching must be carried out across visits and detectors. To 
do this, for each detector, we need to use visit level white-light source lists and the geometric 
outlines for the visit level combined images. The geometric outlines are used to determine 
the mosaics described in Section 12.11 Both the white light source lists and the geometric 
outlines are determined from available source lists and image metadata that reside in the 
HLA database. 

Fig. [[] plots the distribution of the number of white-light images within a mosaic. For 
smaller mosaics, the results follow a power law decline in frequency with number of images. 
The fall-off occurs with an index of about -2.5 and approaches unity at around 50 images. 
There is, however, a long tail that extends beyond that to almost 2000 images. This tail 
involves many of the large HST programs, such as the Ultra Deep Field (UDF). Much of the 
computational challenge lies in the crossmatching within the mosaics of this tail. 



3.2. Mosaics and Source Pairs 



Mosaics are constructed by the application of the algorithms described in Section 12.11 The 
determination of mosa ics is efficiently acc omplished by using the Hierarchical Triangular 
Mesh indexing (HTM; iKunszt et al.l l2000l ) to determi ne white-light images that are c lose 
to a given image and by using the Spherical Library ( IBudavari. Szalay &: Feketd 120101 ) to 
determine whether the images actually overlap. The pipeline completes the cross-matching 
on mosaics sequentially. That is, the crossmatching of sources is carried out for one mosaic 
at a time. Within each mosaic, the pipeline determines pairs of white-light sources whose 
separations are less than 0.3 arc-sec. This threshold is required to accommodate some of 
the more significant systematic offsets between the images. Larger offsets are currently not 
considered in this study. This matching can be carried out efficiently in an SQL database 
that is indexed on RA and Dec. 
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3.3. Refined Astrometry 

Using the white-light source pairs, the pipeline carries out astrometric corrections of the im- 
ages within a mosaic using the algorithms described in Section 12.21 The astrometric correc- 
tion algorithm that uses infinitesimal 3-D rotations is implemented in the C# programming 
language and is integrated directly into the database. The query that selects sources for 
astrometric correction is limited to detections that are likely stars based on the morphology. 
Despite the risk presented by the resolved objects, we also experimented with using the both 
resolved and unresolved sources also, which has the advantage of having more sources. But 
we obtained similar global results in most cases. The details of the sample selection for the 
correction might change in the future. As described in Section |2T2"1 the astrometric correction 
procedure iterates through all images in a mosaic and corrects them one-by-one assuming the 
others are correct. The initial large angular separation limit is decreased over time. In this 
way, any possible wrong associations (due to the generous threshold) are later eliminated as 
the relative calibration of the images tightens up. The convergence of the procedure is fast, 
but can be affected by erroneous initial matches. To avoid such situations, we exclude at run 
time all the pairs whose angular separation grows too large during the iterative optimization, 
as opposed to shrinking as expected. The transformations are persisted in a database table 
and are available to future pixel-level corrections that can potentially improve the quality 
of a combined image. Using the refined astrometry of all images, the cross-identification is 
performed once more with similarly large thresholds, so the list of matching pairs of sources 
is as inclusive as possible. 



3.4. Best Matches 

The pairwise matches are quickly grouped into FoF clusters that are obtained by using 
the sorted tables with appropriate primary keys. These clusters then provide preliminary 
sets of matched sources, which might contain large chains that may then be split by the 
Bayesian method described in Section 12.41 This probabilistic splitting is carried out by an 
SQL stored procedure called the chainbreaker, which is also written in the C# language. 
It reads all the sources grouped into matches and builds a graph out of them. Using the 
refined astrometric positions, we can safely ignore large separations and start with a 0.1 
arc-sec linking length. For each match, the splitting procedure considers a number of cases, 
using the greedy algorithm of Section I2.4[ all the way to separate objects. Each result that 
proves to be better than the unsplit case is saved in a database table. 

For each match, the chainbreaker selects the most favorable case based on equations (1121) 
and f jl~3|) . and stores the results in database tables that are used for catalog construction. 
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The output of the stored procedure includes all the values that are required to link to the 
HLA database and to fetch any relevant details of the observations related to the sources. 

The astrometric correction, FoF and chainbreaker steps are conceptually intertwined. 
After completion of the entire procedure, one could analyze the astrometric uncertainties 
and revise the results by running the procedure repeatedly. This is a possible direction of 
further studies. A completely integrated scheme, where the chainbreaker is applied in each 
astrometric iteration, is impractical due to its computational cost. 

3.5. Catalog 

After the matching of the white-light sources is completed, we apply the available color 
(filter) level information for each white-light source and store the results in a database table 
for the catalog. The source catalog we construct then contains the information about the 
source detections in each filter of a visit. We also include in the catalog the cases of color 
non-detections, i.e., color dropouts, within a visit. Such non-detections can be inferred from 
the two-step source detection process described above, since we know which images were 
used to build the white-light detection image and which images contained a source detection 
at a given position. 

Consider the case that a user specifies a region-based (e.g., position and search radius) 
search of the catalog. Another form of non-detection arises in this case when no source 
is found within a white-light image that overlaps with this user specified region. All such 
white-light non-detections cannot be stored in a database, but instead are inferred at run 
time within the catalog search function we have developed. We therefore provide information 
about detections and two types of non-detections (color and white-light). 

The matches are a collection of color source detections and color non-detections (as 
described above) that lie sufficiently close together in the sky. (White-light nondetections 
do not belong to a match.) Each match consists of at least one detected source. Each match 
can be considered to be describing a single physical object whose properties are determined 
by the properties of sources in the match. For example, each match has a certain position 
associated with it (MatchRA, MatchDec) that is determined by its source positions. This 
position is considered to be the location of the object described by the matched sources. 

Not all matches involve different visits or detectors. For example, images that are 
crossmatched generally contain some sources do not match those in an image obtained from 
a different detector or visit. For other cases, the source could not be crossmatched because 
the image containing the source did not overlap sufficiently with an image from another visit 
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or detector. About 60% of the images with source lists could not be crossmatched with other 
images. The sources in these images are also included in the catalog for completeness. In 
this case, the matches only involve a single detector and visit. 

The crossmatching of the HLA DR6 ACS/WFC and WFPC2 source lists was carried 
out with SQL Server 2008 running on a single (aging) Dell PowerEdge 2950 server with 
Intel Xeon E5430 @ 2.66Ghz (2 processors) and 24GB of memory. The entire pipeline for 
matching the sources and building the catalog runs automatically, and takes a few days to 
complete. 

4. Results 

The final catalog is comprised of entries for over 45 million sources detections and 15 million 
color dropouts for ACS/WFC and WFPC2. The catalog indicates which sources match 
together. Among the matches that involve more than one visit, there are on average 5.4 
detected (color) sources, which on average involve 3.8 different visits. The left panel of Fig. [2] 
plots the cumulative number of detected sources as a function of the number of detected 
sources in a match. About 40% of the sources are in matches with only one detected source. 
About 30% of these cases involve non-overlapping or insufficiently overlapping images. About 
20% of the sources are in matches with 5 or more detected sources. About 2% of the sources 
lie in matches with 50 or more sources. The right panel of Fig. [2] is similar to the left panel, 
but restricted to crossmatched sources. In this case, there are no matches with less than 
2 sources, as required for crossmatching. All these sources are in images that have been 
astrometrically corrected by our algorithm. About 6 million cross-identified detections lie in 
matches with more than 10 sources. 

For each set of matching sources, we determine the standard deviation of the positions 
among the white light sources involved in the match, where more than one white light source 
is involved. (The white light sources are used because they are involved in the crossmatching 
and are statistically independent, unlike the color sources.) The distribution of the standard 
deviations for all matches involving more than one white-light source is shown in Fig. |3j 
The lower curve is based on the the current HST astrometry for these matching sources, i.e., 
positions before our astrometric correction is applied. The upper curve is based on positions 
after astrometric correction. The median before astrometric correction is 32.5 mas. The 
peak (or mode) of the astrometrically corrected distribution is 3.0 mas, and the median is 
9.1 mas. 

In order to test the validity of these matches, we compare the values of the fluxes 
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determined by Source Extractor within radii of 3 pixels of the source centers (FluxAper2) 
for pairs of source detections in the same match that have the same instrument, detector and 
filter. No constraints are applied on the exposure times of the sources in the pairs. Apart 
from variable objects that are rare, we expect the flux difference to be small if the match 
corresponds to a single object. Flux differences may also be caused by detector degradation 
over time. We do not attempt to account for this effect in this paper. 

We define the fractional flux difference between source pairs i and j having fluxes Fi 
and Fj to be 

With this definition, we have that < < 1. In Fig. HI we plot in solid lines the distributions 
of / for ACS/WFC and WFC2 for pairs in the same match that have the same instrument, 
detector and filter. We compare these results to corresponding cases where pairs are not 
matched. To carry out this comparison, we considered the same set of pairs as those used in 
the respective solid lines and mapped the sources in each pair to a random pair taken from 
the same pair of images. The figures show that the matched pair distributions have strong 
peaks near f = 0, while the randomly selected pairs have a very broad distribution with a 
mild peak near f = l. The results provide evidence that these matches contain repeated 
observations of the same physical object. 

A more detailed view of the flux differences for matched pairs is shown in Fig. [5j The 
width of the ACS flux distribution, based on where normalized distribution equals 0.5, is 
equal to 0.027 and the corresponding width of the WFPC2 flux distribution is equal to 
0.047. The smallness of the widths reflect the accuracies of both the matching and the 
HST photometry. Since no constraints are applied to the exposure times of the matched 
pairs, some of the larger flux differences likely involve detections of faint sources with short 
exposure times. In some other cases, the flux differences may reflect physical changes in the 
source brightness. 

We determine effectiveness of the chainbreaker in splitting unphysical matches by again 
considering fractional flux differences between pairs of matching sources that have the same 
instrument, detector and filter. Fig. [6] shows the comparison of the distributions before 
and after the application of the chainbreaker. For the case of ACS/WFC, we see that the 
chainbreaker was effective in reducing the incidence of unphysical matches in the tail of the 
distribution. We note that the plot is semi-logarithmic and the apparent offset between the 
two curves involves only a small fraction of the matches. A smaller improvement was made 
for the case of WFPC2. 

We consider the time distribution covered by the matches. This time coverage of a 



- 16 - 



match is defined as the difference between the earliest start time and latest stop time for the 
exposures containing detected sources in the match. The largest match time span is 6419 
days or about 17.5 years. The time span distributions are plotted in Fig. [7J The number of 
matches greater than some time span falls off roughly logarithmically with the time span. 
Over 1 million matches involving about 10 million source detections have a time span greater 
than a few days. About 7 x 10 5 matches involving about 5 million source detections have 
a time span greater than a year. About 10% of the detected sources lie in matches with 
a duration of more than a year. About 30% of the crossmatched source detections lie in 
matches that have a duration of more than one year. 

Many of the matches involve more than one filter. Fig. [8] shows that more than one 
million matches involve multiple filters. Matches involve as many as 21 filters of detected 
sources and as many as 27 filters of detected sources and color dropouts. More than 7 x 10 4 
matches involve more than 5 filters for detected sources and more than 3.9 x 10 5 matches 
involve more than 5 filters for detected sources and dropouts. Fig. [9] shows how the frequency 
of matches depends jointly on the number of filters and the time duration of a match. 
The minimum match duration considered is one day which implies that all matches being 
considered involve more than HST visit. The frequency of matches extends to long time 
spans for matches with less than 5 filters. There are bands of many-filter cases at several 
time spans. 



5. Summary 

We developed a new approach for positional cross-matching astronomical sources that is well 
suited to the Hubble Space Telescope (HST). We applied the approach to crossmatching HST 
sources detected in the same or different detectors (ACS/WFC and WFPC2) and filters. 

The HST observations comprise a unique astronomy resource. Preserving the measure- 
ments and enhancing their value are important but challenging tasks. While the overall 
volume of data is moderate by today's standards, the dataset presents a number of diffi- 
culties. One of them is the small field of view that does not contain enough calibrators to 
accurately pin down the astrometry of the images. We presented a new algorithm that can 
cross-calibrate overlapping images to each other. We introduced infinitesimal 3-D rotations 
for this purpose, which yield an analytically tractable optimization procedure. Our im- 
plementation of this scheme is sufficient to provide high-precision relative astrometry across 
overlapping images. The improved astrometric accuracy of source positions is typically about 
a few milli-arcseconds (see Fig. [3]). Another challenge is to deal with the complex placement 
of the images on the sky and the fact that certain parts of the sky are observed many 
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times (see Fig. [T]). Instead of the naive combinatorial scaling, we achieve high efficiency in 
a greedy "chainbreaker" procedure that applies Bayesian model selection to find the best 
matches within a mosaic. 

To check on our matching results, we analyzed the flux differences between sources in 
the same match with the same instrument, detector and filter, which should optimally be 
zero (ignoring variable sources). We found that the astrometric correction and the prob- 
abilistic object selection provide reliable matches (see Fig. @J. Based on just positional 
information, the Bayesian model selection rejects spurious matches and improves the tail of 
the distribution with large flux deviations (see Fig. [6]). 

We demonstrated that many of the matches cover a broad range of time spans and fil- 
ters (Figs. HE! and[H])- Therefore, the catalog should enable time-domain, mult i- wavelength 
studies of sources detected by HST. The catalog also provides information about nonde- 
tections. The presented catalog is publicly available online^ via Web forms as well as an 
advanced query interface. 

The catalog provides a basis for further extensions, such as by including other detectors 
and source lists based on other software. The algorithms and tools developed in this paper are 
not specific to the Hubble Space Telescope. They are directly applicable to any astronomy 
observations that exhibit similar challenges. 
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APPENDIX 

A. Derivation of the Astrometric Correction 

In this section we provide a short derivation for Equation (J3J) using a variational method. 
This approach to the minimization is equivalent to writing out the components of the vectors 
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and matrices to obtain the vanishing partial differentials, but more concise. The quadratic 
cost function is 

" 12 (Al) 



(A2) 



C(u>) = ^ w i [ c * -(»*» + « x r<) 

Its minimization yields the value of uj defined by Equation (J3]), that is, 

Cj = argminC(o;) . 



CJe 



To determine the uj solution, we consider small variations in C(oj) for a> ~ cj and require 
that they vanish. This procedure is equivalent to requiring that the partial derivatives of 
C{u>) with respect to the components of cj vanish at cj = uj. The linear variation of the cost 
function due to a small Suj change is 



6C(u>) = -2 ^2 w i [ c i - {ri + u x r<) ■ (5uj x n) 



(A3) 



We use the fact that the terms in a scalar triple product can be cyclically permuted as in 

(cj - ri) ■ [5uj x n) = 5uj ■ [n x (cj - r 4 )] , (A4) 

and the known equivalence for the vector triple products, in addition to that for the dot 
products of two cross products, 



(cj x n) ■ (5uj x n 



5cj ■ [r$ x (u> x r, 
-5cj • [n (r< ■ uj) - r\ cj] . 



Since all are unit vectors, i.e., r i = l ) Equation (1A3|) can now be written as 

5C(U>) = -2 <5cJ ■ ^ W i [ r i X ( C * ~~ r ») + r i ( r i * W) - U) . 

i 

By requiring that 5C(cj) = for arbitrary, but small 5uj, we have that 



(A5) 
(A6) 



(A7) 



^wAriX (a - n) + ri (ri ■ uj) - uj 







(A8) 



The cross-product of any vector by itself vanishes, so rjx(cj — r*j) = rjXCj, and the final 
result is 



^2 W i ( r i X C i) + ( r i ® r i - I) CJ 







(A9) 



Symbol I represents the identity and £g> is the dyadic vector product, hence the term in 
parenthesis is a linear operator applied to the uj vector. This formula is equivalent to 
Equation (j3J) in the main body of the article. 
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Fig. 1. — Log-log plot of the number of mosaics versus the number of white-light (combined- 
filter visit-level ACS/WFC and WFPC2) images per mosaic. The line follows a power law 
with index -2.5. Some popular HST programs are indicated and lie in the long tail of the 
distribution. Crossmatching within these cases provides a computational challenge. 
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Fig. 2. — Left: Semi-log plot of the cumulative number of detected color sources as a function 
of the number of detected color sources in a match. Right: As in left panel, but only for 
sources that are crossmatched with sources in other visits or detectors. 
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Fig. 3. — Plot of the distribution of the standard deviations of the white light source positions 
per match for matches with more than one white light source. The upper (lower) curve is 
after (before) astrometric correction. The lower curve is the current relative astrometry 
provided by HST images. 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

fractional flux difference fractional flux difference 



Fig. 4. — The solid lines plot the distribution of the fractional flux differences between 
pairs of sources (defined in Equation (fT4"l) ) that are in the same match and have the same 
instrument, detector and filter. The bin size is 0.01. The left panel is for ACS/WFC and the 
right panel is for WFPC2. Plotted as dashed lines are the distributions with pairs selected 
randomly from the same images as each of the pairs used for the plots in the solid lines. 
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Fig. 5. — Plot of the distributions of the fractional flux differences between pairs of sources 
(defined in Equation ( I14p ) that are in the same match and have the same instrument, detector 
and filter. The bin size is 0.01. The vertical scale is normalized to unity for zero flux 
difference. 




Fig. 6. — Semi-log plot of the distribution of the fractional flux differences between pairs of 
sources (defined in Equation tHM ) that are in the same match and have the same instrument, 
detector and filter. The bin size is 0.01. The upper (lower) lines are the results before (after) 
the application of chainbreaker tool that splits chains of matches that are loosely connected. 
The lower curves are the same as those of Fig. HI but plotted on a semi-log scale. 
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Fig. 7. — Left: Semi-log plot of the number of matches that have a time span greater than 
some value. Right: Semi-log plot of the number of sources in matches that have a time span 
greater than some value. 
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Fig. 8. — Semi-log plot of the number of matches that have more filters than some value. 
The triangles refer to the number of filters for detected objects. The circles refer to refer to 
the number of filters for detections and color dropouts. 
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Fig. 9. — Plot of the frequency of matches as a function of the number of filters in a match 
and the duration of a match for matches with duration greater than 1 day. Only detected 
sources are considered. The frequency of matches is color coded and taken as the log 10 of 
the number of matches in a bin of 1 filter x 1 day. 



